Skip to content

[VL][TEST]Use Velox's HashTableCache to cache the BHJ's HashTable#12163

Open
JkSelf wants to merge 2 commits into
apache:mainfrom
JkSelf:gluten-hashtable-cache
Open

[VL][TEST]Use Velox's HashTableCache to cache the BHJ's HashTable#12163
JkSelf wants to merge 2 commits into
apache:mainfrom
JkSelf:gluten-hashtable-cache

Conversation

@JkSelf
Copy link
Copy Markdown
Contributor

@JkSelf JkSelf commented May 28, 2026

What changes are proposed in this pull request?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

@github-actions github-actions Bot added CORE works for Gluten Core BUILD VELOX labels May 28, 2026
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

Copilot AI review requested due to automatic review settings June 5, 2026 08:39
@JkSelf JkSelf force-pushed the gluten-hashtable-cache branch from 63a6382 to 10aec74 Compare June 5, 2026 08:39
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 5, 2026

Run Gluten Clickhouse CI on x86

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR wires Velox’s HashTableCache into Gluten’s Velox broadcast hash join (BHJ) path by stabilizing the build-side hash table identifier across AQE wrappers and propagating Spark execution metadata down into the native runtime to support cache scoping/reuse.

Changes:

  • Canonicalize BHJ build hash table IDs across BroadcastQueryStageExec / ReusedExchangeExec so reuse paths share a stable cache key.
  • Add Spark SQL execution id propagation (Java → JNI → native) and use it in Velox task/query identifiers.
  • Integrate Velox HashTableCache injection/drop in native BHJ build/cleanup, and update Velox-side build relation/cache call sites accordingly.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
gluten-ut/spark35/src/test/scala/org/apache/spark/sql/execution/adaptive/velox/VeloxAdaptiveQueryExecSuite.scala Adjusts AQE test conf to avoid an optimizer rule impacting reuse scenarios.
gluten-substrait/src/main/scala/org/apache/gluten/execution/JoinExecTransformer.scala Introduces canonical build hash table ID derivation across AQE wrappers and uses it in join parameters.
gluten-arrow/src/main/java/org/apache/gluten/vectorized/PlanEvaluatorJniWrapper.java Extends JNI kernel creation signature to include Spark execution id.
gluten-arrow/src/main/java/org/apache/gluten/vectorized/NativePlanEvaluator.java Extracts Spark execution id from task local properties and passes it to JNI.
ep/build-velox/src/get-velox.sh Changes default Velox repo/branch selection for builds.
cpp/velox/substrait/SubstraitToVeloxPlan.cc Updates BHJ plan construction to align with Velox-side hash table caching behavior.
cpp/velox/jni/VeloxJniWrapper.cc Injects built hash tables into Velox HashTableCache and updates clone/clear JNI APIs to use cache keys.
cpp/velox/compute/WholeStageResultIterator.cc Uses Spark execution id for Velox task/query identification.
cpp/core/jni/JniWrapper.cc Propagates execution id into native SparkTaskInfo.
cpp/core/compute/Runtime.h Extends SparkTaskInfo with execution id and updates formatting.
backends-velox/src/main/scala/org/apache/spark/sql/execution/unsafe/UnsafeColumnarBuildSideRelation.scala Updates hash table “clone” call to pass cache key.
backends-velox/src/main/scala/org/apache/spark/sql/execution/ColumnarBuildSideRelation.scala Updates hash table “clone” call to pass cache key.
backends-velox/src/main/scala/org/apache/gluten/execution/VeloxBroadcastBuildSideCache.scala Updates hash table clear to drop by cache key.
backends-velox/src/main/scala/org/apache/gluten/execution/HashJoinExecTransformer.scala Uses canonical build hash table ID for broadcast-table resource tracking and reuse.
backends-velox/src/main/java/org/apache/gluten/vectorized/HashJoinBuilder.java Updates native API signatures to include cache key for clone/clear.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 123 to 129
std::unordered_set<velox::core::PlanNodeId> emptySet;
velox::core::PlanFragment planFragment{planNode, velox::core::ExecutionStrategy::kUngrouped, 1, emptySet};
std::shared_ptr<velox::core::QueryCtx> queryCtx = createNewVeloxQueryCtx();
task_ = velox::exec::Task::create(
fmt::format(
"Gluten_Stage_{}_TID_{}_VTID_{}",
std::to_string(taskInfo_.stageId),
std::to_string(taskInfo_.taskId),
std::to_string(taskInfo.vId)),
getVeloxTaskId(taskInfo_),
std::move(planFragment),
0,
Comment on lines 19 to +22
CURRENT_DIR=$(cd "$(dirname "$BASH_SOURCE")"; pwd)
VELOX_REPO=https://github.com/IBM/velox.git
VELOX_BRANCH=dft-2026_06_04
VELOX_ENHANCED_BRANCH=ibm-2026_06_04
VELOX_REPO=https://github.com/JkSelf/velox.git
VELOX_BRANCH=dft-2026_06_04-hashtable-cache
VELOX_ENHANCED_BRANCH=ibm-2026_06_04-hashtable-cache
@JkSelf JkSelf force-pushed the gluten-hashtable-cache branch from 10aec74 to 581f7e5 Compare June 5, 2026 08:56
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 5, 2026

Run Gluten Clickhouse CI on x86

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

BUILD CORE works for Gluten Core VELOX

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants